*Figure 1. Global Supply Chain Logistics — Image Source: Unsplash (free license).
The dataset used in this project, titled Supply Chain Logistics Problem, was originally published on Kaggle by shivaiyer129 (2013). It contains shipment records with variables such as product weight, unit quantity, plant location, and the number of days shipped early or late. This project aims to analyze shipping performance across multiple manufacturing plants to identify operational inefficiencies and improvement opportunities.
The dataset was imported using the readxl package and
cleaned with janitor::clean_names() to standardize variable
names and ensure consistent formatting throughout the analysis.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'janitor'
##
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
##
##
##
## Attaching package: 'plotly'
##
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## The following object is masked from 'package:graphics':
##
## layout
## Rows: 9,215
## Columns: 14
## $ order_id <dbl> 1447296447, 1447158015, 1447138899, 1447363528, 1…
## $ order_date <dttm> 2013-05-26, 2013-05-26, 2013-05-26, 2013-05-26, …
## $ origin_port <chr> "PORT09", "PORT09", "PORT09", "PORT09", "PORT09",…
## $ carrier <chr> "V44_3", "V44_3", "V44_3", "V44_3", "V44_3", "V44…
## $ tpt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ service_level <chr> "CRF", "CRF", "CRF", "CRF", "CRF", "CRF", "CRF", …
## $ ship_ahead_day_count <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ ship_late_day_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ customer <chr> "V55555_53", "V55555_53", "V55555_53", "V55555_53…
## $ product_id <dbl> 1700106, 1700106, 1700106, 1700106, 1700106, 1700…
## $ plant_code <chr> "PLANT16", "PLANT16", "PLANT16", "PLANT16", "PLAN…
## $ destination_port <chr> "PORT09", "PORT09", "PORT09", "PORT09", "PORT09",…
## $ unit_quantity <dbl> 808, 3188, 2331, 847, 2163, 3332, 1782, 427, 1291…
## $ weight <dbl> 14.30, 87.94, 61.20, 16.16, 52.34, 92.80, 46.90, …
After import, the dataset was examined for missing values and variable clarity. To evaluate shipping performance at the plant level, two new variables were created: ship_days: The difference between early and late shipment days, providing a quick indicator of schedule adherence. efficiency_score: A ratio comparing early shipments to late shipments, allowing a standardized comparison of performance across plants. This transformation allows for a meaningful, quantitative assessment of supply chain reliability across facilities.
## order_id order_date origin_port carrier
## Min. :1.447e+09 Min. :2013-05-26 Length:9215 Length:9215
## 1st Qu.:1.447e+09 1st Qu.:2013-05-26 Class :character Class :character
## Median :1.447e+09 Median :2013-05-26 Mode :character Mode :character
## Mean :1.447e+09 Mean :2013-05-26
## 3rd Qu.:1.447e+09 3rd Qu.:2013-05-26
## Max. :1.447e+09 Max. :2013-05-26
## tpt service_level ship_ahead_day_count ship_late_day_count
## Min. :0.000 Length:9215 Min. :0.000 Min. :0.00000
## 1st Qu.:1.000 Class :character 1st Qu.:0.000 1st Qu.:0.00000
## Median :2.000 Mode :character Median :3.000 Median :0.00000
## Mean :1.718 Mean :1.852 Mean :0.03993
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:0.00000
## Max. :4.000 Max. :6.000 Max. :6.00000
## customer product_id plant_code destination_port
## Length:9215 Min. :1613321 Length:9215 Length:9215
## Class :character 1st Qu.:1669702 Class :character Class :character
## Mode :character Median :1683636 Mode :character Mode :character
## Mean :1680536
## 3rd Qu.:1689554
## Max. :1702654
## unit_quantity weight
## Min. : 235 Min. : 0.000
## 1st Qu.: 330 1st Qu.: 1.407
## Median : 477 Median : 4.440
## Mean : 3203 Mean : 19.872
## 3rd Qu.: 1276 3rd Qu.: 13.326
## Max. :561847 Max. :2338.405
## # A tibble: 7 × 7
## plant_code total_orders avg_units avg_weight avg_ship_ahead avg_ship_late
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 PLANT03 8541 3350. 15.9 1.81 0.0404
## 2 PLANT04 1 348 2.10 3 0
## 3 PLANT08 102 2716. 14.5 0.706 0.225
## 4 PLANT09 12 18652. 33.9 4.58 0
## 5 PLANT12 300 373. 27.2 2.53 0
## 6 PLANT13 86 504. 63.1 2.20 0
## 7 PLANT16 173 1417. 186. 2.97 0
## # ℹ 1 more variable: avg_efficiency <dbl>
The Shipping Efficiency by Plant bar chart shows that PLANT09 consistently maintains the highest efficiency score, indicating stronger on time delivery performance compared to other facilities. In contrast, several plants exhibit significantly lower efficiency, suggesting recurring fulfillment delays.
The Ship Early vs. Ship Late scatter plot reinforces this finding. Most shipments cluster near zero late days, while a smaller subset of shipments with high late-day counts likely represents carrier delays, operational bottlenecks, or longer transportation routes.
To explore what factors affect efficiency, a multiple linear regression was conducted using unit quantity, shipping weight, and early, late day count as the independent variable, and efficiency score as the dependent variable. This model helps identify how these features relate to performance.
##
## Call:
## lm(formula = efficiency_score ~ unit_quantity + weight + ship_late_day_count,
## data = drop_na(supply))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7627 -1.8517 -0.0872 1.1495 4.0005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.854e+00 2.096e-02 136.182 < 2e-16 ***
## unit_quantity -1.092e-05 1.323e-06 -8.258 < 2e-16 ***
## weight 1.454e-03 3.173e-04 4.581 4.69e-06 ***
## ship_late_day_count -7.618e-01 6.205e-02 -12.278 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.903 on 9211 degrees of freedom
## Multiple R-squared: 0.02344, Adjusted R-squared: 0.02313
## F-statistic: 73.71 on 3 and 9211 DF, p-value: < 2.2e-16
A multiple linear regression model was used to examine the factors influencing shipping efficiency. The results indicate, Late-day shipments have the strongest negative impact on efficiency (Estimate = -0.76, p < 0.001) .Weight has a small positive relationship with efficiency (Estimate ≈ 0.001). Unit quantity shows a slight negative relationship(Estimate < 0)
Although the model explains a modest portion of variance (R² = 0.023), it provides meaningful insight:reducing late shipments is the most effective way to improve efficiency across plants.
This analysis indicates that shipping performance varies significantly across plants, with PLANT09 demonstrating notably higher reliability. The regression results highlight the impact of operational delays on efficiency, reinforcing the importance of minimizing late shipments to maintain performance standards.
However, several limitations should be noted. The dataset does not include variables such as transportation distance, carrier type, or seasonal demand patterns, which could influence delivery timelines. Future analyses could incorporate forecasting models and carrier level comparisons to improve predictive accuracy and operational planning.
“Figure 2. Joshua Grant, M.S. Data Science & Analytics Student, Clemson University.”
I am Joshua Grant I’m currently pursuing Master’s in Data Science and Analytics at Clemson University, building a strong foundation in statistical modeling, machine learning, and data visualization. With a passion for transforming raw data into actionable insights, I thrive on solving complex problems and uncovering trends that drive smart decision-making.This project demonstrates how data analytics can improve supply chain decision making across multiple plants.